I am exploring the white wine data set to find interesting patterns and to expore the most important variables in predicting subjective wine quality.
I decided not to research known relationships between physicochemical properties or attempts by others at predicting wine quality, but rather let the data speak for itself, in order to minimize bias. I will develop some models for predicting quality and at a later stage compare my findings with those of the original paper on this dataset by Cohen et al.
First I am going to look at the structure of the data, the data types and some basic descriptive statistics. I want to check for any missing data or other data quality issues.
Options have been set, now load packages.
Loading the helper functions. Used references [6], [7] and [8] in constructing these.
The data is contained in a file “wineQualityWhites.csv”.
This dataset is public, available for research. The details are described in:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
There are 11 physicochemical variables recorded from tests.
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm^3)
11 - alcohol (% by volume)
(Any information below on the physicochemical variables is from the Udacity project data description).
Related to each wine is an evaluation of the quality which is a score between 0 and 10. There are at least three ratings and the median of these is taken (I assume an odd number of ratings as there are only whole integers).
I want to find out any relationships between the physicochemical variables themselves and between them and the quality, and then the relative importance of the variables in a potential prediction model.
Loading the data and looking at the structure.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
Calculate statistics for each variable.
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1 Min. : 3.800 Min. :0.0800 Min. :0.0000
## 1st Qu.:1225 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700
## Median :2450 Median : 6.800 Median :0.2600 Median :0.3200
## Mean :2450 Mean : 6.855 Mean :0.2782 Mean :0.3342
## 3rd Qu.:3674 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900
## Max. :4898 Max. :14.200 Max. :1.1000 Max. :1.6600
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.600 Min. :0.00900 Min. : 2.00
## 1st Qu.: 1.700 1st Qu.:0.03600 1st Qu.: 23.00
## Median : 5.200 Median :0.04300 Median : 34.00
## Mean : 6.391 Mean :0.04577 Mean : 35.31
## 3rd Qu.: 9.900 3rd Qu.:0.05000 3rd Qu.: 46.00
## Max. :65.800 Max. :0.34600 Max. :289.00
## total.sulfur.dioxide density pH sulphates
## Min. : 9.0 Min. :0.9871 Min. :2.720 Min. :0.2200
## 1st Qu.:108.0 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100
## Median :134.0 Median :0.9937 Median :3.180 Median :0.4700
## Mean :138.4 Mean :0.9940 Mean :3.188 Mean :0.4898
## 3rd Qu.:167.0 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500
## Max. :440.0 Max. :1.0390 Max. :3.820 Max. :1.0800
## alcohol quality
## Min. : 8.00 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.40 Median :6.000
## Mean :10.51 Mean :5.878
## 3rd Qu.:11.40 3rd Qu.:6.000
## Max. :14.20 Max. :9.000
X just seems to be an identifier giving the row/observation number. The only other integer is the quality, which is known to be on an ordinal scale from 0 to 10.
The X column can be dropped.
The quality column has a data type of int, but it might be better analyzed as an ordered category. To get the best of both worlds, add a version of the quality column that is ordered category.
There is at least one value where citric acid has a value of zero. This might be actually zero or a trace amount rounded down to zero. Check how many rows that is. If not too many, take a look at them.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 116 5.5 0.485 0 1.5 0.065
## 301 6.1 0.410 0 1.6 0.063
## 303 6.1 0.410 0 1.6 0.063
## 781 6.0 0.395 0 1.4 0.042
## 863 4.8 0.340 0 6.5 0.028
## 865 4.8 0.330 0 6.5 0.028
## 891 5.0 0.310 0 6.4 0.046
## 1153 6.1 0.600 0 1.3 0.042
## 1818 7.2 0.500 0 0.8 0.034
## 2322 4.6 0.445 0 1.4 0.053
## 2630 5.8 0.600 0 1.3 0.044
## 3455 5.8 0.540 0 1.4 0.033
## 3572 5.9 0.655 0 5.6 0.033
## 3636 5.8 0.580 0 1.5 0.020
## 4299 5.6 0.260 0 10.2 0.038
## 4598 6.7 0.660 0 13.0 0.033
## 4780 6.0 0.590 0 0.8 0.037
## 4793 4.7 0.785 0 3.4 0.036
## 4878 5.9 0.540 0 0.8 0.032
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 116 8 103 0.99400 3.63 0.40
## 301 36 87 0.99140 3.27 0.67
## 303 36 87 0.99140 3.27 0.67
## 781 7 55 0.99135 3.37 0.38
## 863 33 163 0.99390 3.36 0.61
## 865 34 163 0.99370 3.35 0.61
## 891 43 166 0.99400 3.30 0.63
## 1153 24 79 0.99370 3.31 0.38
## 1818 46 114 0.99320 3.19 0.34
## 2322 11 178 0.99426 3.79 0.55
## 2630 72 197 0.99202 3.56 0.43
## 3455 40 107 0.98918 3.26 0.35
## 3572 8 31 0.99360 3.32 0.51
## 3636 33 96 0.98918 3.29 0.38
## 4299 13 111 0.99315 3.44 0.46
## 4598 32 75 0.99551 3.15 0.50
## 4780 30 95 0.99032 3.10 0.40
## 4793 23 134 0.98981 3.53 0.92
## 4878 12 82 0.99286 3.25 0.36
## alcohol quality quality.cat
## 116 9.7 4 4
## 301 10.8 6 6
## 303 10.8 6 6
## 781 11.2 4 4
## 863 9.9 6 6
## 865 9.9 5 5
## 891 9.9 6 6
## 1153 9.4 4 4
## 1818 9.2 4 4
## 2322 10.2 5 5
## 2630 10.9 5 5
## 3455 12.4 5 5
## 3572 10.5 4 4
## 3636 12.4 6 6
## 4299 12.4 6 6
## 4598 10.7 6 6
## 4780 10.9 4 4
## 4793 13.8 6 6
## 4878 8.8 5 5
There is nothing about the other values that suggests this is a special group of some kind, so I’ll simply accept the values are valid: no citric acid.
Back to the set of statistics. There are some large maximum values compared to the medians: residual.sugar, chlorides, free.sulfur.dioxide, and to a lesser degree total.sulfur.dioxide. I’ll look closer at these in the univariate analysis.
Need to check whether there are any missing values.
## fixed.acidity volatile.acidity citric.acid
## 0 0 0
## residual.sugar chlorides free.sulfur.dioxide
## 0 0 0
## total.sulfur.dioxide density pH
## 0 0 0
## sulphates alcohol quality
## 0 0 0
## quality.cat
## 0
Nothing missing, so on with the plotting.
To get a feel for the values, create a view of the first rows.
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.0 0.27 0.36 20.7 0.045
## 2 6.3 0.30 0.34 1.6 0.049
## 3 8.1 0.28 0.40 6.9 0.050
## 4 7.2 0.23 0.32 8.5 0.058
## 5 7.2 0.23 0.32 8.5 0.058
## 6 8.1 0.28 0.40 6.9 0.050
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 45 170 1.0010 3.00 0.45 8.8
## 2 14 132 0.9940 3.30 0.49 9.5
## 3 30 97 0.9951 3.26 0.44 10.1
## 4 47 186 0.9956 3.19 0.40 9.9
## 5 47 186 0.9956 3.19 0.40 9.9
## 6 30 97 0.9951 3.26 0.44 10.1
## quality quality.cat
## 1 6 6
## 2 6 6
## 3 6 6
## 4 6 6
## 5 6 6
## 6 6 6
I can use the granularity shown to zoom in on the plots. I will make a quick visualization with the helper function, then zoom in. I will try to see the finest level of detail to check for patterns.
Most acids involved with wine are fixed or nonvolatile (do not evaporate readily).
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
The tiny spikes in between the values at 0.1 granularity indicate some values have another decimal place. Look at the individual values.
##
## 3.8 3.9 4.2 4.4 4.5 4.6 4.7 4.8 4.9 5 5.1 5.2 5.3 5.4 5.5
## 1 1 2 3 1 1 5 9 7 24 23 28 27 28 31
## 5.6 5.7 5.8 5.9 6 6.1 6.15 6.2 6.3 6.4 6.45 6.5 6.6 6.7 6.8
## 71 88 121 103 184 155 2 192 188 280 1 225 290 236 308
## 6.9 7 7.1 7.15 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2
## 241 232 200 2 206 178 194 123 153 93 93 74 80 56 56
## 8.3 8.4 8.5 8.6 8.7 8.8 8.9 9 9.1 9.2 9.3 9.4 9.5 9.6 9.7
## 52 35 32 25 15 18 16 17 6 21 3 11 2 5 4
## 9.8 9.9 10 10.2 10.3 10.7 11.8 14.2
## 8 2 3 1 2 2 1 1
There are a few x.x5 values that fall midway between two x.x values, e.g. 6.1, 6.15, 6.2. The frequencies of even numbers are higher in general than those of odd numbers and the x.x5 values do not make up the difference. So there are some rounding effects here or some effect of the equipment output. If this occurs in other variables it won’t be commented on further. There is no reason to standardize/round the values, as this would just remove information. Decided to leave them as they are.
The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste. But we don’t know what “too high” is.
Only a slight skew to the right. View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
Another possibility is to examine the ratio of fixed to volatile acidity.
Found in small quantities, citric acid can add ‘freshness’ and flavor to wines.
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
It seems that there is a much larger number of one value than its position in the distribution would expect. Again, let’s look at the values.
##
## 0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14
## 19 7 6 2 12 5 6 12 4 12 14 1 19 17 27
## 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29
## 23 33 27 49 48 70 66 104 83 181 136 219 216 282 223
## 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.41 0.42 0.43 0.44
## 307 200 257 183 225 137 177 134 122 101 117 82 95 37 63
## 0.45 0.46 0.47 0.48 0.49 0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59
## 46 51 38 39 215 35 25 23 16 19 11 22 13 21 6
## 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69 0.7 0.71 0.72 0.73 0.74
## 6 9 14 4 6 8 7 7 7 5 3 9 5 5 41
## 0.78 0.79 0.8 0.81 0.82 0.86 0.88 0.91 0.99 1 1.23 1.66
## 2 2 2 2 2 1 1 2 1 5 1 1
So there are 215 values of 0.49 compared to neighbours of 39 and 35. That’s an anomaly that merits investigation. Filter out these values and check the distribution of the other variables.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 5.600 Min. :0.0800 Min. :0.49 Min. : 0.900
## 1st Qu.: 6.800 1st Qu.:0.2000 1st Qu.:0.49 1st Qu.: 1.500
## Median : 7.400 Median :0.2500 Median :0.49 Median : 5.000
## Mean : 7.489 Mean :0.2629 Mean :0.49 Mean : 5.793
## 3rd Qu.: 8.000 3rd Qu.:0.3000 3rd Qu.:0.49 3rd Qu.: 8.100
## Max. :14.200 Max. :0.8500 Max. :0.49 Max. :23.500
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.02700 Min. : 3.00 Min. : 18.0
## 1st Qu.:0.03600 1st Qu.:21.50 1st Qu.:113.5
## Median :0.04400 Median :32.00 Median :138.0
## Mean :0.04558 Mean :33.61 Mean :141.2
## 3rd Qu.:0.05100 3rd Qu.:45.00 3rd Qu.:164.0
## Max. :0.23900 Max. :87.00 Max. :247.0
##
## density pH sulphates alcohol
## Min. :0.9893 Min. :2.850 Min. :0.2700 Min. : 8.50
## 1st Qu.:0.9928 1st Qu.:3.065 1st Qu.:0.3750 1st Qu.: 9.70
## Median :0.9940 Median :3.140 Median :0.4500 Median :10.50
## Mean :0.9943 Mean :3.163 Mean :0.4623 Mean :10.48
## 3rd Qu.:0.9956 3rd Qu.:3.240 3rd Qu.:0.5300 3rd Qu.:11.20
## Max. :1.0024 Max. :3.650 Max. :0.9800 Max. :13.00
##
## quality quality.cat
## Min. :4.000 3: 0
## 1st Qu.:5.000 4: 9
## Median :6.000 5: 54
## Mean :5.893 6:110
## 3rd Qu.:6.000 7: 36
## Max. :9.000 8: 5
## 9: 1
## [1] "Count: 215 of 4898, Share: 4.39%"
Nothing stands out from scanning the data. Perhaps there was for some wines a carefully controlled process for exactly this level of citric acid.
The amount of sugar remaining after fermentation stops. It’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. The values are in grams per cubic decimeter, which is the same as grams per liter. How many sweet wines are there by this definition?
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 2782 7.8 0.965 0.6 65.8 0.074
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## 2782 8 160 1.03898 3.39 0.69
## alcohol quality quality.cat
## 2782 11.7 6 6
Just one wine with a value of 65.8 would be considered sweet by this definition, which seems improbable but may have something to do with how the wines for this data set were selected. According to Wikipedia, in the EU the following thresholds apply: dry: < 4g/l medium dry 4-12g/l medium 12-45g/l sweet > 45g/l
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
This is quite different from the other distributions, that have a more heaped and symmetrical distribution. There are some higher frequencies around 1 to 2.
It has the characteristics of a value that is being targeted, rather than a natural random variation. Or perhaps multiple distributions overlayed on each other, like particular brands of wine having ranges of sweetness.
It is rare to find wines with less than 1. How many of those are there?
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :4.500 Min. :0.1200 Min. :0.0000 Min. :0.6000
## 1st Qu.:5.900 1st Qu.:0.2000 1st Qu.:0.2700 1st Qu.:0.8000
## Median :6.500 Median :0.2400 Median :0.3100 Median :0.9000
## Mean :6.716 Mean :0.2857 Mean :0.2992 Mean :0.8442
## 3rd Qu.:7.200 3rd Qu.:0.3300 3rd Qu.:0.3600 3rd Qu.:0.9000
## Max. :9.200 Max. :0.9050 Max. :0.7400 Max. :0.9500
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01900 Min. : 3.00 Min. : 21.00
## 1st Qu.:0.03200 1st Qu.:12.00 1st Qu.: 72.00
## Median :0.03700 Median :22.00 Median : 95.00
## Mean :0.03984 Mean :25.69 Mean : 98.53
## 3rd Qu.:0.04200 3rd Qu.:36.00 3rd Qu.:126.00
## Max. :0.18500 Max. :89.00 Max. :204.00
##
## density pH sulphates alcohol
## Min. :0.9885 Min. :2.800 Min. :0.2800 Min. : 8.00
## 1st Qu.:0.9903 1st Qu.:3.080 1st Qu.:0.3500 1st Qu.: 9.80
## Median :0.9916 Median :3.210 Median :0.3900 Median :10.60
## Mean :0.9914 Mean :3.187 Mean :0.4181 Mean :10.58
## 3rd Qu.:0.9925 3rd Qu.:3.280 3rd Qu.:0.4700 3rd Qu.:11.40
## Max. :0.9936 Max. :3.540 Max. :0.7700 Max. :13.10
##
## quality quality.cat
## Min. :3.000 3: 1
## 1st Qu.:5.000 4:11
## Median :5.000 5:29
## Mean :5.377 6:31
## 3rd Qu.:6.000 7: 4
## Max. :8.000 8: 1
## 9: 0
## [1] "Count: 77 of 4898, Share: 1.57%"
77 out of 4898 can be considered rare. Ca. 1.6% seems ok.
Maybe this data doesn’t follow a more “normal” distribution because of attempts to create a low residual sugar and keep the wine dry. It might be a candidate for transformation, but it seems more like two superimposed distributions than one skewed distribution. One distribution below 20, superimposed with one distribution below 2.
The amount of salt in the wine.
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
A well-balanced distribution with a few large outliers.
The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine. In low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Again, fairly well balanced, with a few large outliers.
How many wines have a concentration over 50 ppm? I used reference [2] to check the conversion to the units of the provided data. The conversion is 1:1.
In relation to the base unit of density (kilograms per cubic meter):
1 Part Per Million (ppm) = 0.001 kilograms-per-cubic-meter
1 Milligram Per Cubic Decimeter (mg/dm3) = 0.001 kilograms-per-cubic-meter.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. :4.200 Min. :0.1050 Min. :0.0000 Min. : 0.900
## 1st Qu.:6.400 1st Qu.:0.2200 1st Qu.:0.2700 1st Qu.: 5.675
## Median :6.700 Median :0.2600 Median :0.3400 Median : 9.200
## Mean :6.804 Mean :0.2716 Mean :0.3652 Mean : 9.027
## 3rd Qu.:7.300 3rd Qu.:0.3100 3rd Qu.:0.4400 3rd Qu.:12.600
## Max. :9.600 Max. :0.6950 Max. :1.2300 Max. :23.500
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 50.5 Min. : 97.0
## 1st Qu.:0.04075 1st Qu.: 54.0 1st Qu.:155.0
## Median :0.04600 Median : 58.0 Median :178.0
## Mean :0.05021 Mean : 61.5 Mean :179.8
## 3rd Qu.:0.05100 3rd Qu.: 64.0 3rd Qu.:201.2
## Max. :0.34600 Max. :289.0 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9874 Min. :2.860 Min. :0.2500 Min. : 8.000
## 1st Qu.:0.9942 1st Qu.:3.080 1st Qu.:0.4200 1st Qu.: 9.200
## Median :0.9959 Median :3.160 Median :0.4800 Median : 9.500
## Mean :0.9957 Mean :3.169 Mean :0.4999 Mean : 9.861
## 3rd Qu.:0.9976 3rd Qu.:3.230 3rd Qu.:0.5600 3rd Qu.:10.400
## Max. :1.0024 Max. :3.760 Max. :0.9900 Max. :13.500
##
## quality quality.cat
## Min. :3.000 3: 5
## 1st Qu.:5.000 4: 14
## Median :6.000 5:349
## Mean :5.705 6:390
## 3rd Qu.:6.000 7: 85
## Max. :9.000 8: 24
## 9: 1
## [1] "Count: 868 of 4898, Share: 17.72%"
So ca. 18% may have a noticeable bad odour.
It would be interesting to see if these wines are rated lower compared to the rest.
## Up to 50 has mean: 5.915136 Above 50 has mean: 5.705069
Which does confirm the idea that above 50 there is a noticeable bad odour, so quality ratings are lower. Could run a test to confirm the difference is significant, but not now. The counts are large enough to support the idea.
Amount of free and bound forms of SO2.
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Mostly symmetrical with a slight skew. No real need to transform.
Since Total Sulfur Dioxide includes Free and Bound forms a new variable can be created which is: Bound Sulfur Dioxide = Total Sulfur Dioxide - Free Sulfur Dioxide.
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 78.0 100.0 103.1 125.0 331.0
This variable could be used together with Free Sulfur Dioxide. Another possibility is to examine the proportion of free to bound sulfur dioxide as a variable.
The density of wine is close to that of water, depending on the percent alcohol and sugar content.
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
Again, there are some values that were probably already rounded and so occur more frequently.
Describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale.
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
Compared to the description of “most wines are between 3 and 4”, this distribution is somewhat further towards the 3 value, but otherwise no particular anomalies.
A wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant.
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
Mostly symmetrical with a slight skew. No real need to transform. The appearance of being bimodal is due to an unusually high number of 0.38 results. There is a higher number of 0.50 results which may be due to some rounding.
The percent alcohol content of the wine. Wine tends to be between 9 and 16%.
View a summary of the statistics.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
There seem to be two or three modes in the distribution. This may be caused by attempts to achieve a particular alcohol content. E.g. around 9.5, 10.5 and 12.5. Some peaks may also be due to rounding, e.g. 10.0, 10.5, 11.0, 12.0, 12.5.
Like residual sugar, it has the characteristics of a value that is being targeted, rather than a natural random variation.
Check the relationship between alcohol content and residual sugar. I believe that high alcohol corresponds with low residual sugar because more is used in the fermentation.
This is a distribution without outliers, which is probably good for the customers!
The subjective rated quality of the wine. The quality rating is the median of three independent ratings on a scale of 0 to 10.
Confirm that the ratings are whole integers.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
Looking at the frequency of each category there are only 5 wines behind the 9 rating and 20 behind the 3 rating. Those are small sample sizes.
Visualize those frequencies in a bar chart.
There are no 0s, 1s, 2s, or 10s. These are extremely bad or good ratings and may also be missing due to the way in which the median of the scores of at least 3 assessors are used. So it is hard to get 0 or 10.
The distribution seems well-balanced with a range of values covering all usual quality levels, which will help find relationships between quality and the other variables. The mode (and median) is 6, so slightly in favour of better wines, but no issue.
There are 11 continuous variables representing measurements of various physicochemical properties, plus a variable representing the subjective assessment of the wine quality (using an ordinal discrete scale from 0 to 10). There is an additional variable (X) that is simply a row index and so was dropped.
There are 4898 rows of data, each representing a wine that was measured and assessed.
The quality variable is most important as the intention of the dataset is to enable quality to be predicted through the other variables.
All the other variables are potential explanatory variables. Most of the variables show a somewhat gaussian distribution, which might imply that these are not particularly “steered” during the winemaking process.
Of main interest would be those 7 explanatory variables that intuitively most likely influence taste: - Residual Sugar i.e. sweetness. - Alcohol i.e. strength of alcohol. - Volatile Acidity i.e. vinegar-like taste. - Citric Acid i.e. freshness and flavor. - Free Sulfur Dioxide (evident in “nose” and taste of wine at high concentrations). - Chlorides i.e. saltiness.
The other 4 features are of secondary interest, but the more extreme values need to be considered. Total Sulfur Dioxide includes “Free” and “Bound” forms, but intuitively it is the free forms that impact taste and smell. The bound forms were calculated as an additional variable. Similarly, volatile acidity intuitively affects taste and smell more than fixed acidity.
Since Total Sulfur Dioxide includes Free and Bound forms a new variable was created which was:
Bound Sulfur Dioxide = Total Sulfur Dioxide - Free Sulfur Dioxide.
Residual Sugar and Alcohol show distributions that might be a result of specific interventions to achieve a specific classification. For example, residual sugar determines whether the wine can be labeled “dry”, “sweet”, etc.
Although there were a few extreme values in the data, these were left in for now. None of the distributions suggested there would be an advantage in transformations.
For the rest of the analysis, shorten the column names. They are:
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
## [13] "quality.cat" "bound.sulfur.dioxide"
to be shortened to:
## [1] "fixed" "volatile" "citric" "sugar" "chlorides"
## [6] "free.so2" "total.so2" "density" "pH" "sulphates"
## [11] "alcohol" "quality" "quality.cat" "bound.so2"
Lets’s look in detail at each relationship with quality.
The following relationships are interesting: quality.cat and each of the explanatory variables.
For each variable, look at the mean and distribution via a box plot, and if necessary, zoom in. Also look at summary statistics.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 10.3 10.9 9.77 10.4
## 2 4 10.2 10.3 10.00 10.1
## 3 5 9.81 9.85 9.77 9.5
## 4 6 10.6 10.6 10.5 10.5
## 5 7 11.4 11.5 11.3 11.4
## 6 8 11.6 11.8 11.4 12
## 7 9 12.2 13.4 10.9 12.5
Alcohol has a clear relationship with quality. The ‘dip’ in the middle is interesting. Maybe there are relationships for ‘above average’ and ‘below average’ quality wines and both involve an increase in alcohol away from the average.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 0.475 0.531 0.418 0.44
## 2 4 0.476 0.494 0.458 0.47
## 3 5 0.482 0.487 0.477 0.47
## 4 6 0.491 0.496 0.486 0.48
## 5 7 0.503 0.512 0.494 0.48
## 6 8 0.486 0.508 0.464 0.46
## 7 9 0.466 0.581 0.351 0.46
There is a linear relationship from 5 to 7 not visible in the medians, but otherwise uncertain.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 3.19 3.29 3.09 3.22
## 2 4 3.18 3.21 3.16 3.16
## 3 5 3.17 3.18 3.16 3.16
## 4 6 3.19 3.19 3.18 3.18
## 5 7 3.21 3.22 3.20 3.2
## 6 8 3.22 3.24 3.20 3.23
## 7 9 3.31 3.41 3.21 3.28
A linear relationship from 5 to 7, uncertain with other categories.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 0.995 0.996 0.994 0.994
## 2 4 0.994 0.995 0.994 0.994
## 3 5 0.995 0.995 0.995 0.995
## 4 6 0.994 0.994 0.994 0.994
## 5 7 0.992 0.993 0.992 0.992
## 6 8 0.992 0.993 0.992 0.992
## 7 9 0.991 0.995 0.988 0.990
Zoom in on the box plot.
This is a clear linear relationship for 5 and above, slightly disrupted by category 4.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 53.3 85.8 20.8 33.5
## 2 4 23.4 26.5 20.2 18
## 3 5 36.4 37.4 35.5 35
## 4 6 35.7 36.3 35.0 34
## 5 7 34.1 35.0 33.2 33
## 6 8 36.7 39.1 34.3 35
## 7 9 33.4 50.1 16.7 28
Zoom in on the box plot.
No significant relationship but interesting that low levels relate to category 4, because it is the high levels of free so2 that are supposed to be detectable.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 117. 150. 84.8 106
## 2 4 102. 108. 95.4 102
## 3 5 114. 116. 113. 114
## 4 6 101. 103. 100.0 97
## 5 7 91.0 92.8 89.1 86
## 6 8 89.4 93.4 85.5 84
## 7 9 82.6 110. 55.3 82
A similar pattern to density, again with category 4 not agreeing with the linear relationship.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 171. 221. 120. 160.
## 2 4 125. 133. 117. 117
## 3 5 151. 153. 149. 151
## 4 6 137. 139. 135. 132
## 5 7 125. 127. 123. 122
## 6 8 126. 131. 121. 122
## 7 9 116 141. 91.4 119
Since free sulfur dioxide showed no particular pattern over the quality categories, the pattern here in total is driven by the bound part. So total and bound look similar and are highly correlated.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 0.0543 0.0760 0.0326 0.041
## 2 4 0.0501 0.0541 0.0461 0.046
## 3 5 0.0515 0.0529 0.0502 0.047
## 4 6 0.0452 0.0461 0.0444 0.043
## 5 7 0.0382 0.0389 0.0375 0.037
## 6 8 0.0383 0.0403 0.0364 0.036
## 7 9 0.0274 0.0366 0.0182 0.031
Zoom in on the box plot.
Chlorides show a clear linear relationship with quality.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 6.39 8.88 3.90 4.6
## 2 4 4.63 5.27 3.98 2.5
## 3 5 7.33 7.61 7.06 7
## 4 6 6.44 6.66 6.23 5.3
## 5 7 5.19 5.47 4.90 3.65
## 6 8 5.67 6.31 5.04 4.3
## 7 9 4.12 8.79 -0.548 2.2
Zoom in on the box plot.
Again, some linear relationship from 5 to 7, disrupted by category 4 and to some extent 8.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 0.336 0.374 0.298 0.345
## 2 4 0.304 0.330 0.279 0.290
## 3 5 0.338 0.345 0.330 0.32
## 4 6 0.338 0.343 0.333 0.32
## 5 7 0.326 0.331 0.320 0.31
## 6 8 0.327 0.339 0.314 0.32
## 7 9 0.386 0.488 0.284 0.36
Zoom in on the box plot.
The differences between categories are minimal and 4 again is an exception.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 0.333 0.399 0.267 0.26
## 2 4 0.381 0.408 0.354 0.32
## 3 5 0.302 0.307 0.297 0.28
## 4 6 0.261 0.264 0.257 0.25
## 5 7 0.263 0.269 0.257 0.25
## 6 8 0.277 0.294 0.261 0.26
## 7 9 0.298 0.370 0.226 0.27
Again an interesting “dip” around 6 and 7. Given the confidence intervals, it could be argued that for 6 and above higher quality is not related to lower volatile acidity, but higher volatile acidity does relate to lower quality ratings. This could be an example where the effect of a variable is non-linear - high levels lead to poor quality, low levels do not lead to a higher quality. This might partition the results at around 0.26 volatile acidity.
## # A tibble: 7 x 5
## quality.cat mean upper_ci lower_ci median
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 7.6 8.41 6.79 7.3
## 2 4 7.13 7.30 6.96 6.9
## 3 5 6.93 6.98 6.89 6.8
## 4 6 6.84 6.87 6.80 6.8
## 5 7 6.73 6.78 6.68 6.7
## 6 8 6.66 6.78 6.53 6.8
## 7 9 7.42 8.64 6.20 7.1
This shows a clear linear relationship with quality.
An overview of just the correlations would be helpful. Let’s use corrgram.
The variables have been “effect-ordered” using principal component analysis of the correlation matrix, such that variables correlating are clustered together. See reference [4].
There are strong correlations between sugar and density (positive) and between alcohol and density (negative). The relationships between these three need to be explored.
Alcohol has the strongest positive correlation with quality: more is better. Most other variables have a negative correlation with quality: less is better.
There is a cluster of positive correlation around total so2, bound so2, sugar and density. Total so2 and the derived variable bound so2 are very strongly correlated with each other so that for the models I must choose between them.
I decided to continue with total.so2 since I can then compare the results of any models with results from others. It might be interesting in the future to see whether using the bound variable would improve model results.
Check interesting relationships between the original explanatory variables.
## fixed volatile citric sugar chlorides
## fixed 1.00000000 -0.02269729 0.28918070 0.08902070 0.02308564
## volatile -0.02269729 1.00000000 -0.14947181 0.06428606 0.07051157
## citric 0.28918070 -0.14947181 1.00000000 0.09421162 0.11436445
## sugar 0.08902070 0.06428606 0.09421162 1.00000000 0.08868454
## chlorides 0.02308564 0.07051157 0.11436445 0.08868454 1.00000000
## free.so2 -0.04939586 -0.09701194 0.09407722 0.29909835 0.10139235
## total.so2 0.09106976 0.08926050 0.12113080 0.40143931 0.19891030
## density 0.26533101 0.02711385 0.14950257 0.83896645 0.25721132
## pH -0.42585829 -0.03191537 -0.16374821 -0.19413345 -0.09043946
## sulphates -0.01714299 -0.03572815 0.06233094 -0.02666437 0.01676288
## alcohol -0.12088112 0.06771794 -0.07572873 -0.45063122 -0.36018871
## free.so2 total.so2 density pH sulphates
## fixed -0.0493958591 0.091069756 0.26533101 -0.4258582910 -0.01714299
## volatile -0.0970119393 0.089260504 0.02711385 -0.0319153683 -0.03572815
## citric 0.0940772210 0.121130798 0.14950257 -0.1637482114 0.06233094
## sugar 0.2990983537 0.401439311 0.83896645 -0.1941334540 -0.02666437
## chlorides 0.1013923521 0.198910300 0.25721132 -0.0904394560 0.01676288
## free.so2 1.0000000000 0.615500965 0.29421041 -0.0006177961 0.05921725
## total.so2 0.6155009650 1.000000000 0.52988132 0.0023209718 0.13456237
## density 0.2942104109 0.529881324 1.00000000 -0.0935914935 0.07449315
## pH -0.0006177961 0.002320972 -0.09359149 1.0000000000 0.15595150
## sulphates 0.0592172458 0.134562367 0.07449315 0.1559514973 1.00000000
## alcohol -0.2501039415 -0.448892102 -0.78013762 0.1214320987 -0.01743277
## alcohol
## fixed -0.12088112
## volatile 0.06771794
## citric -0.07572873
## sugar -0.45063122
## chlorides -0.36018871
## free.so2 -0.25010394
## total.so2 -0.44889210
## density -0.78013762
## pH 0.12143210
## sulphates -0.01743277
## alcohol 1.00000000
The following relationships are interesting (r is the correlation coefficient for complete dataset): 1. density and residual sugar (r = 0.84) 2. alcohol and density (r = -0.78) 3. total.so2 and density (r = 0.53) 4. alcohol and residual sugar (r = -0.45) 5. total.so2 and alcohol (r = -0.45) 6. pH and fixed.acidity (r = -0.43)
Since we now have a reasonable number of factors (6), let’s use pairs in the psych package to look at them.
The lower part shows bivariate 68% concentration ellipses and loess smoothing curves so we can see where the most data lies and how much slope in that region.
pH does not correlate with any variables except fixed acidity and that might be because pH itself is a scale of base vs acidity.
For each combination, we’ll generate two plots: one with the full data and one with a random 500 sample.
There is a rather extreme outlier in the data.
## fixed volatile citric sugar chlorides free.so2 total.so2 density pH
## 1 7.8 0.965 0.6 65.8 0.074 8 160 1.03898 3.39
## sulphates alcohol quality quality.cat
## 1 0.69 11.7 6 6
Leave it in the dataset for now on the assumption that all values are valid and verified. Given the size of the dataset and that the outlier is not extremely far away from the rest of the ‘cloud’ (not necessarily a corrupt value), it shouldn’t skew the values too much and is in category 6 where there is plenty of other data to reduce its effect.
There is a concentration ‘band’ of values at the lower end of the residual sugar range, just above zero. So, regardless of density, some effect causes these to be more frequent.
Strong linear relationship. The large outlier has already been examined.
Moderate linear relationship. The large outlier has already been examined.
This is a less natural relationship as there is banding at the lower levels of sugar as itapproaches zero. There may be a limit in how much sugar can be reduced. There are also many cases where both low alcohol and low sugar occur.
Moderate linear relationship with some asymmetry of the ‘cloud’.
Moderate linear relationship.
Of the explanatory variables, 9 show a clear relationship with the quality categories. The two that don’t were: free so2 and citric acid. By “clear relationship” I mean I can see a consistent relationship with some of the categories and by looking at the confidence intervals rather than the mean, I can imagine the relationship holding if the sample size were larger (of course a confidence interval is not a prediction interval!). This visual approximation was made at the upper and lower categories (3,4,8,9), which include fewer wines. Looking at the frequencies again:
##
## 3 4 5 6 7 8 9
## 20 163 1457 2198 880 175 5
There are only 5 wines in the 9 category and 20 in the 3 category.
Volatile acidity had a clear relationship for categories up to 6, then leveled out.
There were two effects that stood out during the investigation: 1. The quality category 4 results seem to be an exception for many of the variable relationships. This occurred for: alcohol, residual sugar, density, free so2, total so2, and citric acid. 2. There is a ‘band’ of values around the lower levels of residual sugar (nearing zero), which means they occurred more frequently in the data than the other residual sugar values. We already saw that residual sugar and alcohol have distributions that suggest an intervention taking place, either in the making of the wine or in the selection process of the dataset. There may have been some stratification in the selection process that ensures certain segments of wine were represented. There is no information on this available.
The scatterplot of total so2 and alcohol showed an asymmetrical regression line through the cloud. This can be investigated further in the multivariate section.
Visually, the strongest relationship with the quality categories was with alcohol and total sulfur dioxide. Between the explanatory variables, the strongest relationship was between density and residual sugar, closely followed by density and alcohol.
The relations between density, alcohol and residual sugar can be investigated further in the multivariate section.
Pick a color-blind friendly color scheme for coding the quality categories. Red means poor quality, blue means good quality. The 7 categories are thus:
So average ratings are less intense colors, more divergent for good (blue) and poor (red) ratings have the more intense colors.
First look at alcohol and density. Since presence of alcohol lowers the density, put density on the y-axis.
The improvement of quality with reduced density and increased alcohol can be seen. Otherwise there is nothing particularly stratified about the pattern, which means that alcohol is decisive for quality perception, not density. One could argue, that there is a tendency for high quality wines (8 or 9) to be above the regression line, but it’s a small effect.
Alcohol should partition the data well.
Now look at alcohol and sugar. Sugar is fermented into alcohol, so put alcohol on the y-axis.
Poor quality is common for all residual sugar levels if the alcohol is low enough. I would expect the lower left corner to be missing, if high alcohol means low sugar. The converse is not seen: high alcohol and high residual sugar.
Again, alcohol should partition the data well. Or an interaction of alcohol and sugar.
Lets now investigate further the residual sugar-alcohol-density relationship to see whether residual sugar or alcohol as more influence on density.
Here we see the relationship between alcohol and residual sugar gradually decreasing as the density increases. So sugar doesn’t affect density as much as alcohol, or there would be more slope in the lines with low alcohol/high density. Let’s swap alcohol and density.
Here we see the same relationship between alcohol and density holds consistently at all sugar levels.
I want to look at an example where category 4 did not follow the pattern of the other categories. The biggest disruptive effect was with residual sugar, alcohol, density and total so2. Since residual sugar, density and alcohol are related, I choose residual sugar (y) and total so2 (x). I’m going to facet on quality.
I didn’t see anything in category 4 that looks different from the other categories. Try again with alcohol and density.
Again, no clear reason. So the influences on category 4 are more to do with subtle shifts in distributions, which is picked up by the statistics.
Now let’s look at some of the other variables.
The better ratings tend to be with lower total so2.
I’m curious if free so2 looks any different.
There is a band of free so2 near 0 that seems to be rated particularly poorly. But then the scores with high total so2 are poor. One could split first by one variable then the other, but the total so2 would split more points and so would probably be more important in a partitioning model.
There seem to be better ratings at low fixed acidity. High volatile acidity is not so clear but low seem better. It’s hard to tell which one would partition better.
This one is more interesting. Low chloride levels (less saltiness) are associated with good ratings. There is a cluster around a pH of 3.2 that have much larger chloride levels than the ‘cloud’ and these are almost always rated poorly. Should partition well.
Finally, one of the relationships found earlier: pH and fixed acidity.
In general, high pH corresponds to low acidity, so the relationship makes sense. There is a tendency for good ratings to be with high pH and low fixed acidity.
From the visual analysis, it loks like alcohol, total so2, fixed acidity and chlorides would be important variables.
At this stage I am going to look at some simple prediction models to try to find out what the most important variables affecting quality are. I am going to model the quality in two ways. Given the large number of samples, although quality has ordinal discrete values, it is acceptable to treat quality as a continuous variable. For some models, I wlll try to predict the categorical quality values.
I will use a 70%/30% split to generate a random training and test set. The more advanced models with caret will be set up to do n-fold cross-validation using the training set. Where appropriate I will repeat that cross-validation three times. For the continuous models I will look at metrics such as RMSE, R-Squared and MAE, and for the categorical models I will compare Accuracy and Kappa values.
First generate the training and test sets.
Now set up formulae for predicting the continuous and categorical quality.
## quality ~ fixed + volatile + citric + sugar + chlorides + free.so2 +
## total.so2 + density + pH + sulphates + alcohol
## quality.cat ~ fixed + volatile + citric + sugar + chlorides +
## free.so2 + total.so2 + density + pH + sulphates + alcohol
Using the continuous formula, start simple with a rpart model (Recursive Partitioning And Regression Trees).
First ignore the penalties and see how the complexity parameter (cp) varies with reducing the relative error.
I saw visually the optimum around 5 and added a reference line at that value. Optimization should stop at around 5 nodes. Now let the caret package decide.
It decided on 6. Let’s see what variables it selected.
##
## Regression tree:
## rpart(formula = formula_continuous, data = treeTrain)
##
## Variables actually used in tree construction:
## [1] alcohol free.so2 volatile
##
## Root node error: 2716.7/3428 = 0.7925
##
## n= 3428
##
## CP nsplit rel error xerror xstd
## 1 0.163503 0 1.00000 1.00120 0.025151
## 2 0.048399 1 0.83650 0.84349 0.024057
## 3 0.030740 2 0.78810 0.79671 0.023538
## 4 0.018273 3 0.75736 0.77869 0.022864
## 5 0.010407 4 0.73908 0.75077 0.021718
## 6 0.010000 5 0.72868 0.74924 0.021697
It selected alcohol and free so2 and volatile acidity as the split variables. That’s a bit surprising, but we are making big splits to the data, which are hard to see when viewing the variables two at a time.
Let’s visualize the major tree splits. Plot the left branch of the decision tree.
The first decision is to cover the area below the horizontal line as it has poorer quality. The area to the left of the left vertical line has better quality, then the area between the vertical lines, then the area to the right of the right vertical line.
Now plot the right branch of the decision tree.
The first decision is to cover the area above the lower horizontal line as it has better quality. The area to the left of the vertical line has poorer quality, and for the area to the right of the vertical lines, quality is better above the top horizontal line.
So we can track the decisions roughly on the graphics.
It was interesting that it made a decision based on free so2 and not total so2.
Let’s see how this simple model performs. First make a set of predictions with the test data.
Let’s look at the range of predictions, because the splits so far are distinguishing good/bad areas around the middle. Compare predictions to observations. First the predictions statistics:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.257 5.547 5.926 5.877 6.209 6.616
Then statistics for the observations being predicted (in the test set):
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 5.00 6.00 5.87 6.00 8.00
We are not predicting beyond categories 5 and 6.
## 3 4 5 6 7 8 9
## 7 53 415 698 243 54 0
3, 4, 8 and 9 may suffer from low samples, but we should at least reach 7.
Calculate the RMSE metric from the observations and predictions.
## [1] 0.7485636
On average, the predictions are off by ca. 0.75, which might seem quite good but the majority of predictions are distinguishing 5 from 6 from 7.
Let’s try tuning the model using the rpart method in caret. We’ll do 10-fold cross-validations for 3 repeats. Instead of tuning over the default 3 values of Cp, we’ll do 30.
## CART
##
## 3428 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3086, 3085, 3085, 3085, 3085, 3085, ...
## Resampling results across tuning parameters:
##
## cp RMSE Rsquared MAE
## 0.002397130 0.7571325 0.2899568 0.5916011
## 0.002436190 0.7569029 0.2901679 0.5915401
## 0.002468413 0.7566983 0.2902438 0.5916062
## 0.002606896 0.7562768 0.2893742 0.5935173
## 0.002683425 0.7559366 0.2894841 0.5933004
## 0.002892613 0.7542470 0.2907379 0.5931311
## 0.003009863 0.7532875 0.2915859 0.5927112
## 0.003090272 0.7535723 0.2907528 0.5937174
## 0.003241063 0.7531557 0.2908492 0.5937972
## 0.003269676 0.7533839 0.2903486 0.5941263
## 0.003430740 0.7541194 0.2884547 0.5951529
## 0.003561024 0.7540492 0.2880643 0.5950557
## 0.003714169 0.7537236 0.2883697 0.5947478
## 0.003737597 0.7540729 0.2877677 0.5947709
## 0.004422305 0.7537121 0.2867248 0.5961655
## 0.004515881 0.7540622 0.2860141 0.5964520
## 0.004709184 0.7553321 0.2835749 0.5974940
## 0.004981479 0.7557710 0.2826124 0.5986878
## 0.005011718 0.7559164 0.2823196 0.5989794
## 0.005653559 0.7579347 0.2781485 0.6010130
## 0.006130020 0.7613270 0.2713719 0.6034467
## 0.006149680 0.7613411 0.2713524 0.6034156
## 0.006351919 0.7624079 0.2690498 0.6045826
## 0.007046163 0.7640770 0.2654774 0.6067103
## 0.007103542 0.7641887 0.2652305 0.6067072
## 0.010407236 0.7671678 0.2588612 0.6120345
## 0.018273243 0.7737322 0.2464031 0.6224251
## 0.030739657 0.7860451 0.2221236 0.6320496
## 0.048398871 0.8027803 0.1887521 0.6520621
## 0.163503257 0.8602109 0.1329783 0.6835411
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was cp = 0.003241063.
Optimal RSME was about 0.75, in line with the rpart package predictions.
Visualize the model.
Splits at the higher level are very similar to the rpart model. Again, major decision points are with alcohol, volatile acidity and free so2.
Plot the importance of the variables as determined by the model.
Chlorides ranks second on importance but only appears once on the tree and only very low down. Volatile is also ranked lower than expected. Obviously variables used for splits and importance do not need to line up. I need to research that later.
Would a simple linear regression model do better?
## Linear Regression
##
## 3428 samples
## 11 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3084, 3086, 3086, 3084, 3085, 3086, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.7591005 0.2744523 0.5923417
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
No practically significant difference from the rpart model.
What does this model think are important variables?
Linear model variable importance
Again a surprise as total so2 virtually disappears and alcohol is pushed right down the list.
I ran the lm without any preprocessing. But lm often runs better with standardized variables (subtract mean from each value and divide by the standard deviation). So run the same model with standardization.
## Linear Regression
##
## 3428 samples
## 11 predictor
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3084, 3085, 3086, 3085, 3085, 3085, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.7595102 0.2731548 0.5925265
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
No practically significant difference from the preprocessing.
Since it is linear regression, we may have an issue with multicollinearity. Run a check. It’s usual to take out columns that correlate over a cutoff of 0.70.
## [1] "Predictors over 0.70 correlation cutoff: density"
In this case that is density, which we already know correlates strongly with alcohol. Run the model again without density. Set up the new formula.
## quality ~ fixed + volatile + citric + sugar + chlorides + free.so2 +
## total.so2 + pH + sulphates + alcohol
Now run the model.
## Linear Regression
##
## 3428 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3085, 3085, 3085, 3085, 3084, 3086, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.767057 0.2584003 0.596313
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
No practically significant difference for the model performance.
Instead of choosing myself, add a Principal Component Analysis to the preprocessing. This will keep the components above a variance threshold (leave the default 0.95). The components will be selected to best capture 95% of the variance. (Density is back in).
## Linear Regression
##
## 3428 samples
## 11 predictor
##
## Pre-processing: centered (11), scaled (11), principal component
## signal extraction (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3086, 3086, 3086, 3085, 3085, 3085, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.7660879 0.2613879 0.5960826
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Again no practically significant improvement. The pca kept in all 11 components. So it sees no unnecessary explanatory variables.
Going back to the model without density, look at the variable importance.
Reduced standardized linear model variable importance
Now look at what the recursive partitioning in caret determined.
Recursive partitioning model variable importance
The variable importance in the linear model is now closer to what the partitioning determined.
Now alcohol is back on the top of the lm list. The rest is still quite mixed and it is unclear why important variables in the tree model did not appear high in the tree decision hierarchy.
Neither model is particularly good. There are a number of other models and optimizations that could be attempted.
So let’s try a Support Vector Machine (SVM) model in caret. We’ll start by predicting the quality categories, so we switch to accuracy as a metric. TuneLength of 9 is typical, fitting 9 values in the default grid of cost parameters.
## Support Vector Machines with Linear Kernel
##
## 3428 samples
## 11 predictor
## 7 classes: '3', '4', '5', '6', '7', '8', '9'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3085, 3086, 3084, 3086, 3084, 3087, ...
## Resampling results:
##
## Accuracy Kappa
## 0.5079888 0.1823607
##
## Tuning parameter 'C' was held constant at a value of 1
Accuracy is poor. We can look at the confusion matrix to see what it means.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 5 38 249 133 13 4 0
## 6 2 15 166 565 230 50 0
## 7 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5537
## 95% CI : (0.5279, 0.5794)
## No Information Rate : 0.4748
## P-Value [Acc > NIR] : 8.239e-10
##
## Kappa : 0.2346
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7 Class: 8
## Sensitivity 0.000000 0.00000 0.6000 0.8095 0.0000 0.00000
## Specificity 1.000000 1.00000 0.8171 0.4003 1.0000 1.00000
## Pos Pred Value NaN NaN 0.5633 0.5496 NaN NaN
## Neg Pred Value 0.995238 0.96395 0.8385 0.6991 0.8347 0.96327
## Prevalence 0.004762 0.03605 0.2823 0.4748 0.1653 0.03673
## Detection Rate 0.000000 0.00000 0.1694 0.3844 0.0000 0.00000
## Detection Prevalence 0.000000 0.00000 0.3007 0.6993 0.0000 0.00000
## Balanced Accuracy 0.500000 0.50000 0.7085 0.6049 0.5000 0.50000
## Class: 9
## Sensitivity NA
## Specificity 1
## Pos Pred Value NA
## Neg Pred Value NA
## Prevalence 0
## Detection Rate 0
## Detection Prevalence 0
## Balanced Accuracy NA
As we saw with the simple tree model, predictions are limited to categories 5 and 6 and don’t extend into the other categories. That’s why the accuracy is so low.
SVM linear model variable importance
Here the variable importance is given for each quality category (so X7 means quality category 7). The importance of chlorides is seen for category 7. Alcohol and free so2 are prominent across most categories.
We can try non-linear “radial” basis functions. This will attempt to fit non-linear relations. There are many variables that do not hold a linear relationship across all categories.
## Support Vector Machines with Radial Basis Function Kernel
##
## 3428 samples
## 11 predictor
## 7 classes: '3', '4', '5', '6', '7', '8', '9'
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3083, 3086, 3085, 3085, 3085, 3085, ...
## Resampling results across tuning parameters:
##
## C Accuracy Kappa
## 0.25 0.5420078 0.2534430
## 0.50 0.5517200 0.2736790
## 1.00 0.5579473 0.2891171
##
## Tuning parameter 'sigma' was held constant at a value of 0.07628414
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.07628414 and C = 1.
The accuracy seems higher. Let’s see how that impacts the confusion matrix.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 3 4 5 6 7 8 9
## 3 0 0 0 0 0 0 0
## 4 0 1 0 0 0 0 0
## 5 5 40 258 125 9 1 0
## 6 2 11 156 538 168 36 0
## 7 0 1 1 35 66 17 0
## 8 0 0 0 0 0 0 0
## 9 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.5871
## 95% CI : (0.5614, 0.6124)
## No Information Rate : 0.4748
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.321
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 3 Class: 4 Class: 5 Class: 6 Class: 7
## Sensitivity 0.000000 0.0188679 0.6217 0.7708 0.27160
## Specificity 1.000000 1.0000000 0.8294 0.5168 0.95599
## Pos Pred Value NaN 1.0000000 0.5890 0.5906 0.55000
## Neg Pred Value 0.995238 0.9646018 0.8479 0.7138 0.86889
## Prevalence 0.004762 0.0360544 0.2823 0.4748 0.16531
## Detection Rate 0.000000 0.0006803 0.1755 0.3660 0.04490
## Detection Prevalence 0.000000 0.0006803 0.2980 0.6197 0.08163
## Balanced Accuracy 0.500000 0.5094340 0.7255 0.6438 0.61380
## Class: 8 Class: 9
## Sensitivity 0.00000 NA
## Specificity 1.00000 1
## Pos Pred Value NaN NA
## Neg Pred Value 0.96327 NA
## Prevalence 0.03673 0
## Detection Rate 0.00000 0
## Detection Prevalence 0.00000 0
## Balanced Accuracy 0.50000 NA
The non-linear SVM model was better than the linear model. You can see from the confusion matrix that it starts to form a more diagonal pattern, with the inclusion of predictions for category 7.
Generate the variable importance.
SVM non-linear model variable importance
OK, that looks the same as for the previous linear model, which suggests the method for calculating the importance is independent of the model method.
Finally, to compare with the other models, let’s run the non-linear SVM on the quality column - not as categories but as a continuous variable. We can obtain the RMSE value and compare to the other models.
## Support Vector Machines with Radial Basis Function Kernel
##
## 3428 samples
## 11 predictor
##
## Pre-processing: centered (11), scaled (11)
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 3085, 3086, 3086, 3085, 3086, 3084, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 0.7194641 0.3498053 0.5455965
## 0.50 0.7140075 0.3594573 0.5398859
## 1.00 0.7081665 0.3699813 0.5351721
##
## Tuning parameter 'sigma' was held constant at a value of 0.07834557
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.07834557 and C = 1.
Generate a summary to look at the range of the predicted values, which was an issue with the early rpart models.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.301 5.400 5.855 5.830 6.231 7.370
Whereas the first rpart model predictions did not extend beyond categories 5 and 6, the new model is now extending into the quality categories 4 and 7.
Look at the variable importance.
Continuous SVM non-linear model variable importance
Now compare the 6 caret models we created for predicting the quality as a continuous variable.
Models with continuous quality as target:
RPTC rpart
LM linear regression
LMS linear regression standardized
LMSR linear regression standardized reduced (no density)
LMSP linear regression standardised principal component analysis
SVRQ non-linear SVM (radial)
##
## Call:
## summary.resamples(object = results_continuous)
##
## Models: RPT, LM, LMS, LMSR, LMSP, SVRQ
## Number of resamples: 30
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RPT 0.5494183 0.5762993 0.5891019 0.5937972 0.6094715 0.6479804 0
## LM 0.5499537 0.5704551 0.5935185 0.5923417 0.6065380 0.6439286 0
## LMS 0.5537672 0.5773168 0.5917375 0.5925265 0.6072782 0.6299780 0
## LMSR 0.5593763 0.5857704 0.5963897 0.5963130 0.6117304 0.6342780 0
## LMSP 0.5485558 0.5815662 0.5916593 0.5960826 0.6101454 0.6518739 0
## SVRQ 0.4981283 0.5163948 0.5343199 0.5351721 0.5508056 0.5948575 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RPT 0.6873955 0.7360141 0.7526879 0.7531557 0.7755424 0.8110713 0
## LM 0.6996858 0.7296200 0.7595643 0.7591005 0.7850522 0.8428440 0
## LMS 0.7030907 0.7375116 0.7634454 0.7595102 0.7769783 0.8444220 0
## LMSR 0.7042386 0.7545738 0.7658895 0.7670570 0.7850160 0.8267510 0
## LMSP 0.6969380 0.7482190 0.7587367 0.7660879 0.7884206 0.8315144 0
## SVRQ 0.6545835 0.6855503 0.7056025 0.7081665 0.7302604 0.7770454 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RPT 0.2002407 0.2605949 0.2904932 0.2908492 0.3112415 0.3735017 0
## LM 0.1843671 0.2467907 0.2701046 0.2744523 0.2944929 0.3604742 0
## LMS 0.1971150 0.2556266 0.2678179 0.2731548 0.2938054 0.3608088 0
## LMSR 0.2103202 0.2348679 0.2513786 0.2584003 0.2833495 0.3458324 0
## LMSP 0.1609149 0.2400387 0.2594913 0.2613879 0.2774192 0.3522588 0
## SVRQ 0.3113475 0.3379534 0.3733088 0.3699813 0.3943314 0.4472066 0
Looks like the SVM model was significantly better on all counts. Visualize the values with confidence intervals.
Continuous model metric comparison
No question that the SVM model is the best so far.
Finally, let’s compare the 2 caret models we created for the quality as a categorical variable.
Models with categorical quality as target:
SVML linear SVM
SVMR non-linear SVM (radial)
##
## Call:
## summary.resamples(object = results_categorical)
##
## Models: SVML, SVMR
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVML 0.4635569 0.4959891 0.5094625 0.5079888 0.5171666 0.5584795 0
## SVMR 0.5218659 0.5423360 0.5583090 0.5579473 0.5724845 0.6064140 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## SVML 0.1062522 0.1596202 0.1883400 0.1823607 0.1985738 0.2640654 0
## SVMR 0.2246130 0.2658121 0.2912582 0.2891171 0.3085541 0.3687203 0
The non-linear model performs better than the linear model. Visualize the values with confidence intervals.
Categorical model metric comparison
Again we see visually that for the categorical models, the non-linear model performs better than the linear model.
Changes in density seem to be mainly driven by changes in alcohol levels rather than changes in residual sugar levels. We see the relationship between alcohol and residual sugar gradually decreasing as the density increases. So sugar doesn’t affect density as much as alcohol.
There was a tendency for lower total so2 and lower fixed acidity to relate to higher quality.
Low chlorides (experienced as saltiness) relate to higher quality.
Poor quality is common for all residual sugar levels if the alcohol is low enough. We don’t see high alcohol and high residual sugar combinations.
So there are ‘dry’ wines with low residual sugar AND low alcohol. So either the alcohol or the residual sugar was removed from the wine, perhaps? Either way, these combinations are rated poorly, so why would one make that effort? Or the fermentation quickly ran out of sugar and stopped. Perhaps this is a consequence of trying to create a dry wine by starting with less sugar and overdoing it. The wine doesn’t ferment enough alcohol to get a better rating.
Moderately sweet wines can get a good rating if there is enough alcohol. There are a few good wines with low alcohol and moderate residual sugar, but these are the exception. In general, better ratings correspond to more alcohol content.
I started with a simple rpart partitioning model and then looked at improving this via models using the caret package. In addition to a new cross-validated rpart model, I looked at linear models with standardized variables and with a high correlating variable (density) removed.
The caret rpart model (RMSE 0.75, MAE 0.59) was slightly better than the best linear model (RMSE 0.77, MAE 0.60).
I then generated Support Vector Machine models in two forms: a categorical prediction and a continuous variable prediction. For the categorical prediction I tried a linear and a non-linear model.
The SVM using continuous variable performed better than the rpart and linear models. I achieved an RMSE of 0.71 and an MAE of 0.54.
For the categorical prediction, the non-linear (radial) SVM performed better than the linear SVM.
After the models were generated I compared the performance metrics of my models with those of Cortez et al. in their paper.
They ran SVM on quality as a continuous variable. They achieved an accuracy (tolerance=0.5 to be compatible with my model) of 64.6 with SVM, compared to my non-linear SVM of 58.6. Their Kappa was 43.9 compared to my 32.0. They reported 0.45 MAE (they name it MAD) compared to my 0.54.
So their results were better.
The paper states that the data comes from Vinho Verde, a Portugese wine. According to the Wikipedia entry on Vinho Verde, “The white Vinho Verde is very fresh, due to its natural acidity, with fruity and floral aromas that depend on the grape variety. The white wines are lemon- or straw-coloured, around 8.5 to 11% alcohol, and are made from local grape varieties Loureiro, Arinto, Trajadura, Avesso, and Azal. Vinho Alvarinho is made from Alvarinho grapes, from a small designated subregion of Monção and Melgaço. It has more alcohol (11.5 to 14%) and ripe tropical aromas.”
The distribution of Alcohol values might therefore be overlaying two distributions, one for Vinho Verde “classic” with a range of 8.5% to 11% and one for Vinho Alvarinho with a range of 11% to 14%. But this is speculation and there is no data to distinguish these in the dataset.
Relationship between Alcohol, Residual Sugar and Quality in White Wine
I was surprised to see so many data points at low levels of sugar and alcohol The quality was rated poorly for these combinations. Poor quality is common for all residual sugar levels if the alcohol is low enough and there are very few high alcohol and high residual sugar combinations.
So there are ‘dry’ wines with low residual sugar AND low alcohol. So either the alcohol or the residual sugar was removed from the wine, perhaps? Either way, these combinations are rated poorly, so why would one make that effort? Or the fermentation quickly ran out of sugar and stopped. Perhaps this is a consequence of trying to create a dry wine by starting with less sugar and overdoing it. The wine doesn’t ferment enough alcohol to get a better rating.
Moderately sweet wines can get a good rating if there is enough alcohol. There are a few good wines with low alcohol and moderate residual sugar, but these are the exception. In general, better ratings correspond to more alcohol content.
Recursive partitioning and regression tree using ‘rpart’ package
The first model made using the recursive partitioning package rpart showed the importance of alcohol in that it determined the first split. The other variables were somewhat more surprising because visual analysis of the scatterplots seemed to show fixed acidity (rather than volatile acidity) and total so2 (rather than free so2) would lead to a clearer partition. Also, chlorides, which looked like a good candidate for partitioning did not appear.
The reason is likely to be that the visual analysis sees both sides of the tree at once. If the left and right sides were to be split in the data into two sets and then visualized, they might then show the subsequent splits more clearly.
Most important variables from recursive partitioning in ‘caret’ package
The rpart model in caret generated slightly better performance metrics than the model from the rpart package. Looking at the the relative importance of the variables, it was surprising to see that volatile acidity was much less important in this model.
The importance of the variables in recursive partitioning is estimated by summing the reduction in the loss function attributed to each variable at each split, including the top competing variables. Perhaps the reason for this difference is that the caretmodel went a number of levels deeper which influenced the cumulative loss reduction contribution.
An interesting aspect of this model is that chlorides become much more important, which was a finding from the visual analysis.
The final surprise occurred when reading the variable importance in the paper from Cohen et al. In their list. sulphates were the most important variable, whereas in this model, and in the visual analyses, sulphates was not considered important at all.
I was pleased with the way the plots worked out. Two of my three final plots were generated by packages and I couldn’t find any way to customize them further. But I added captions. I used them because they were key results from this analysis.
The mean_box helper function was supposed to be a time-saver but I had difficulties passing the arguments into the groupby and aesthetics and there is probably a more elegant way of doing that. I burnt more time than I saved, but I learned a lot. I will use this function again.
I struggled with getting the right mixture of granularity and color for the multivariate plots.
I need to improve my knowledge of further optimizing models.
I did not achieve the model performance reported by Cohen et al. in their paper, but they conducted an exhaustive optimization of the hyperparameters and of the selected features, that I didn’t have time for (and that would have made this report much longer!). Future work could include such optimization and models created with other packages than caret.
The random sampling method that I used left some categories with low sample counts because I withheld 30% for a test set. But the caret package runs cross-validations to assess its model performance. So it might be possible to use the whole data to create models and see how caret assesses them. There would be no way to verify them so there is a risk of overfitting.
I have not used caret before and was surprised how quickly I could get to useful models. I would like to better understand why there are so many variations in the lists of important variables depending on the model. E.g. Cohen et al. reported sulphates as their most important variable, which never came up in my analysis.
It was helpful not to read ahead or research the topic in advance, as it did not bias my analysis. There have been a lot of analyses with this dataset. Future work could include a meta-analysis of the work on this dataset to identify gaps and new approaches. I did get some help from reference [5] which had some example rpart package models with this same dataset.
Now I’m looking forward to a glass of wine…
[1] https://stackoverflow.com/questions/8317231/elegant-way-to-report-missing-values-in-a-data-frame
[3] https://stackoverflow.com/questions/28427572/manipulating-axis-titles-in-ggpairs-ggally
[4] https://stats.stackexchange.com/questions/26920/variable-ordering-using-pca
[5] Book: “Machine Learning with R” by Brett Lanz ISBN 978-1-78216-214-8
[6] https://stackoverflow.com/questions/38220963/how-to-pass-a-variable-name-in-group-b
[7] https://stackoverflow.com/questions/41783396/passing-arguments-to-dplyr-summarize-function